Comparison of the data mining and machine learning algorithms for predicting the final body weight for Romane sheep breed

The current study aimed to predict final body weight (weight of fourth months of age to select the future reproducers) by using birth weight, birth type, sex, suckling weight, age at suckling weight, weaning weight, age at weaning weight, and age of final body weight for the Romane sheep breed. For this purpose, classification and regression tree (CART), multivariate adaptive regression splines (MARS), and support vector machine regression (SVR) algorithms were used for training (80%) and testing (20%) sets. Different data mining and machine learning algorithms were used to predict final body weight of 393 Romane sheep (238 female and 155 male animals) were used with different artificial intelligence algorithms. The best prediction model was obtained by CART model, both training and testing set. Constructed CART models indicated that sex, suckling weight, weaning weight, age of weaning weight, and age of final weight could be used as an indirect selection measure to get a superior sheep flock on the final body weight of Romane sheep. If genetically established, the Romane sheep whose sex is female, age of final weight is over 142 days, and weaning weight is over 28 kg could be chosen for affording genetic improvement in final body weight. In conclusion, the usage of CART procedure may be worthy of reflection for identifying breed standards and choosing superior sheep for meat yield in France.


Introduction
Sheep is one of the first domesticated animals within the scope of archaeological and genetic studies [1,2]. As a result of this domestication process, it has led to an improvement in sheep breeding over time. Multi-purpose sheep breeding is very noteworthy not only for the development of healthy civilizations but also for obtaining animal products such as milk, meat, and a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Materials and methods
Data on the Romane sheep breed was provided by INRA in order to compare the algorithms. Romane sheep (Berrichon du Cher by Romanov crossbred) was developed to increase the productivity of the French sheep herd. INRA created the synthetic strain INRA 401 by crossing the Berrichon du Cher breed (good meat yield and quality but not very prolific, not very maternal, white fleece) with the Romanov breed (very prolific, maternal, but low butchering skills and coloured fleece). The line obtained was named Romane breed in 2006. It has an average litter size of 2 lambs per birth (for 3-year-old ewes), excellent fertility in the off-season, and good viability of the lambs at birth [29]. Housing conditions during the suckling period: ewes and lambs in the barn, ewes are fed with a total mix ration composed of wrapped forage bales and concentrates, and lambs are fed ad-libitum with concentrates. This study used data from 393 animals (155 male and 238 female) born in August, September and October 2021. Sex, birth weight, suckling weight, age at suckling weight (about 30 days), weaning weight, age at weaning weight, and age at final body weight were used to estimate the final weight that weight of fourth months of age to select the future reproducers (rams and ewes), still young and not mature adults.
Descriptive statistics were showed to response and explanatory variables according to each sex. The comparison of each variable was compared by using the independent samples t-test (p<0.05).
The Classification and Regression Tree (CART) algorithm was proposed by Breiman et al. [30]. With the CART algorithm, a binary split tree structure created by splitting a variable homogeneously includes the two sub-nodes. In the CART algorithm, the process begins from the root node, including the initial data set, and continues until many homogeneous subnodes are gotten, which will supply the minimum error variance.
Multivariate Adaptive Regression Splines (MARS) algorithm suggested to overcome the classification and regression type problems by Friedman [31]. The MARS algorithm is one of regression procedure that facilitates a more influential description of interaction, linear and nonlinear effects between explanatory and response variables. There is no need for the MARS algorithm assumptions as in linear regression [27,[32][33][34].
The MARS algorithm generates base functions according to a step-by-step procedure, taking into account all possible interaction effects between candidate nodes and explanatory variables. The algorithm includes two different steps such as forward and backward pass steps, respectively [35]. The initial steps is the forward pass process which begins to determine the term of intercept in the initial pattern, and to improve the model, iteratively contains the initial patterns coupled with the least training error. The forward pass steps typically products an overfitted configuration that achieves extreme complexity [27,31]. The model built from the forward pass process fits predominantly worthy. However, it can be difficult to overfitting in terms of generalization ability. The initial patterns that specify the smallest amount of the prediction model are abolished in the backward pass process, and this process is hand-me-down in the resolution to this problem [18,35,36]. The MARS algorithm is a significant instrument that can take linear and nonlinear relationships between dependent and independent variables [37,38]. The equation for MARS procedure utilized to predict body weight from explanatory variables is below.ŷ here,ŷ: the predicted value of BW, β 0 : the intercept of the model, β m : the basis functions' coefficient, K m : the parameter that determines the degree of interaction, h km (X v(k,m) ): the determined basis functions of the model.
Generalized cross-validation error (GCV) is eliminated by using variable selection (forward and backward) and thus the performance of the model is increased. The calculation of GCV is [27,36,39]: here; n: the size of training set, y i : response variable' observed value (BW), (ŷ i ): the estimation of the response variable (BW), M(λ): is called the penalty term for the complexity for the model that includes the λ terms. At the initiation of the MARS procedure, the relationship between the explanatory variables, called multicollinearity, was tested and it was determined that there was no multicollinearity problem between the explanatory variables. To predict BW utilization in the training set. The cross-validation procedure helped to select the greatest MARS model among 72 MARS models (degree = 1:2 and nprune = 2:38). In addition, for training data set, ten-fold cross-validation technique was utilized for optimal MARS model. A significant twig of the support vector machine (SVM), which is a machine learning algorithm, is the support vector regression (SVR) algorithm [40]. Here, struggling with classification is named support vector classification (SVC), while struggling with modeling and prediction is named SVR [33,[41][42][43]. Although SVR is also a supervised learning method, the prediction performance obtained from SVR varies depending on the training and test set [33,44].
The main goal in the linear SVR model is to define a function of f(x) which can have the maximum deviation (ε) from the training set. Training set points are built into the boundary between −ε to +ε [44]. However, in most studies, it cannot be shown within the scope of linear properties. Therefore, when using nonlinear SVR, the input data is matched to a higher dimensional Hilbert space (H) so that the regression line can be linear [40].
The hyperplane of the nonlinear SVR to be obtained is as follows; In this equation, weight vector is defined by w, non-linear kernel functions are defined by ϕ (x), indicates vector inner product is defined by h.,.i and the term of b is a bias term. There are many nonlinear kernel functions. In the current study, gaussian radial basis kernel function were used.
In comparison of the model performances, the following goodness of fit criteria were used [27,28,33,36,45]: 1. Root-mean-square error (RMSE): RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 1 n 2. Akaike information criterion (AIC): 3. Standard deviation ratio (SD ratio ): 4. Global relative approximation error (RAE): RAE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 ðy i À y ip Þ 2 P n i¼1 y i 2 s ð7Þ 6. Mean absolute percentage error (MAPE): 7. Pearson correlation coefficient (r): 8. Performance Index (PI): where, n is the size of training data set, k is called the number of parameters for the model, y i is the real value of the response variable (BW),ŷ i is the predicted value for response variable (BW), s d is the standard deviation for the response variable (BW), s m is the standard deviation for optimum model's errors [33]. RMSE, SD ratio , CV, PI, RAE, MAPE and AIC, r and R 2 criteria were used to compare the performance of the model. For this aim, it needs to determine best model for smallest RMSE, SD ratio , CV, PI, RAE, MAPE and AIC values for train and test set and also, the highest r, R 2 value for all algorithms [46].
All statistical evaluation was performed using R software [47]. To required knowledge of the data, descriptive statistics were utilized. The descriptive statistics for explanatory and response variables were predicted by using "psych" package in R environment [48]. The "caret" packages in the R software were used to perform analyze of the CART and MARS algorithms [49]. Also, support vector regression algorithm was performed by using "e1071" package in R software [50]. To display the performances of the constructed all models, the "ehaGoF" package was utilized [51].

Results
Descriptive statistics for all variables by sex (male and female) factor were expressed as mean ± standard error and are given in Table 1.
In Table 2, Pearson's correlation coefficients for defining the association with response and explanatory variables except for age of suckling weight, age of weaning weight, and age of final weight. Final body weight had a small correlation coefficient with the birth weight, suckling weight, weaning weight with the coefficients of 0.29, 0.42, 0.52, respectively. Among the explanatory variables, relatively high correlation was determined only between weaning weight and suckling weight. The other correlation coefficients were low or moderate and can be interpreted as having no relationship. Fig 1 points out the constructed CART regression tree diagram in estimation of final weight from explanatory variables such as birth weight (bw), suckling weight (sw), age at suckling weight (asw), weaning weight (ww), age at weaning weight (aww), and age at final weight (afw) and sex (1 = male, 2 = female) factor.
The final weight of the Romane sheep breed was divided into two groups based on sex. On the left side of the diagram, if the sex was female, the mean of the final body weight was determined as approximately 36 kg. The female Romane sheep were divided into two groups based on weaning weight, that is, weaning weight <23 kg and weaning weight �23 kg. If the weaning weight was �23 kg, the tree was divided into the age of final weight<106 days with the mean of 37 kg. In addition, if the age of the final weight was above 106 days, the final weight was divided into two groups as for weaning weight was over and under 27 kg. According to Fig 1, if the sex was female, weaning weight �23 kg, age at final weight�106 days, and weaning weight�27 kg, the predicted final weight was determined as 45 kg. On the right side of the diagram, if the sex was male, the mean of the final body weight was determined as approximately 42 kg. The male Romane sheep were divided into two groups for the age at final weight at 142 days. If the age of final weight is under 142 days, the final body weight was divided into for suckling weight with the coefficient of 18 kg. In addition, if the sex was male, the age of final weight was over 142 days, weaning weight over 28 kg, the final body weight was determined as 64 kg. Table 3 gives the results of the CART algorithm depending on the cross-validation procedure. The constructed CART regression tree algorithm made 12 terminal knots (size of regression tree) with relative error (0.148) and the cross-validation error (0.312) which means that coefficient of determination (R 2 ) and cross-validation R 2 were close to each other with 0.852 (1-0.148) and 0.688 (1-0.312), respectively ( Table 3). The relative error can estimate the R 2, and cross-validation error can be defined as the mean of the cross-validated prediction errors and cross-validation R 2 is calculated by 1-relative error. Cross-validation standard deviation can be defined as the standard deviation of the cross-validated prediction errors (Table 3). Table 4 shows that the final body weight can be explained with the eight basic functions in the MARS prediction model. The first term of the MARS prediction model is the intercept with value of 36.770. In the second term, the case where the sex is female for a negative coefficient of -4.369. The third term (Suckling weight-12.8) had a cutpoint of 12.8 kg with a coefficient of 0.814. The fourth term was the suckling weight age, with a cutpoint of 35 with a useless. The seventh and the last term of the MARS prediction model were for the age of final weight with a cutpoint of 133 days.
To estimate the final body weight, the obtained best prediction MARS model allows breeders to make more precise decisions considering herd management, such as the necessary feed amount, medicinal drug doses for Romane sheep breed, and establishing the selling value of each sheep.
At the start of the SVR algorithm was instructed to the training set. Later performing the training processes, the SVR was assessed to predict the final body weight for Romane sheep breed. In the current study, the gaussian radial basis kernel function was used for estimating the final body weight of the Romane sheep breed. The consistency of the model is based on the selection parameters such as cost (C), gamma and epsilon. The current study determined cost, gamma, and epsilon values as 1, 0.083, and 0.1, respectively. These parameters were verified for several values, and analysis was affected for C and epsilon values. So, it would provide the greatest trustworthy model. To determine which variable is more effective, sensitivity analysis was carried out. The sensitivity analysis was calculated to estimate the relative importance values of the affected explanatory variables on final body weight. In Fig 2, the sensitivity analysis results were given. According to Fig 2, the age of final weight was a more effective variable on final body weight for support vector regression algorithms. The second effective variable was weaning weight. Apart from these variables, sex variables (male and female), suckling weight, age of weaning weight, and birth weight were also important in determining the final body weight. The least effective variables in determining the final body weight were the age of suckling weight, birth type (2, 3, and 4), and number of co-suckled lambs.
For comparing all models, the goodness of fit criteria was used. The model performance results for CART, MARS, and SVR algorithms, based on the goodness-of-fit criteria, were provided in Table 5. Table 5 shows the finest prognostic model was realized for CART algorithm. The CART algorithm had the smallest RMSE, SD ratio , CV, PI, RAE, MAPE, and AIC values for the train and testing set. Also, the highest R 2 value was determined for the CART algorithm.
In addition, The Pearson correlation coefficient was determined as 0.923 for the original data and 0.902 for the training and test sets for the predicted data, respectively. In addition, since the coefficient of variation (CV) values for both the train and the testing set in each model are below 30%, it is concluded that the results obtained from the applied models are reliable.

Discussion
Various statistical methods can be used to explain the relationship between body weight and various characteristics. The important point here is to use the correct statistical method. There are even some studies to estimate live weight in different breeds; no study has been found for the Romane sheep breed. In the present study, the final weight of male Romane sheep breed had a greater mean with comparison female Romane sheep breed (p<0.05).
Alonso et al. [43] utilized the SVR algorithm to predict the carcass weight in Asturiana de los Valles beef. To predict the carcass weight, the study was made by using 390 measurements from 144 animals. According to the study's results, 150 days before slaughter time the ideal carcass weight prediction was obtained. They found MAPE values as 4.12 and 4.91 for train and test, respectively. The results of the MAPE values were smaller than our results. The achievement of their study may be caused by a sample size greater than ours.
Celik et al. [3] aimed to compare CART, CHAID, Exhaustive CHAID, MARS, MLP, and RBF for Mengali rams. The greatest prediction model was described as the CART algorithm in the scope of the model comparison criteria such as R 2 , RMSE and SD ratio . The CART and MARS algorithms we used in our study were also used in this study. Celik et al. found that the CART algorithm was more reliable than the MARS algorithm [3]. This result supports our current study results. Iqbal et al. [52] was aimed to examine performances of the random forests, regression tree, SVR algorithm, and gradient boosting machine algorithms. For this aim, Iqbal et al. [52] was used the data obtained from Beetal goats. To predict the body weight, they some biometric measures such as sex, and body length. As model comparison criteria such as Pearson's correlation, R 2 , RMSE, MAPE, and MAE were used. The results of Iqbal et al. [52] showed that the gradient boosting machine was stated to be the greatest model for estimating the body weight of Beetal goats. However, the random forest regression algorithm was determined to be the second-best algorithm. Even if the SVR algorithm does not give the best results, even the data structure impacts on the prediction performances.
Marco et al. [53] tried to use AdaBoost ensemble learning method and RFR for different data sets. Several machine learning algorithms, for different data sets MLP, SVR, CART, kNN, and RFR were applied. According to the results obtained from most of the datasets applied in the study, they stated that it is a reliable and successful algorithm for RFR. However they indicated that the CART algorithm was also a useable algorithm, even if the CART algorithm wasn't very sensitive to parameter tuning when RFR was so sensitive to parameter tuning, which results in stable prediction performance.
Tırınk [54] compared various artificial intelligence methods such as Multivariate Adaptive Regression Splines, Random Forest Regression, Bayesian Regularized Neural Network, and Support Vector Regression algorithms to estimate body weight from biometric measurements for the Thalli sheep breed. For this aim, 270 female Thalli sheep breeds were used. According to the results, the MARS algorithm was the best prediction model for Thalli sheep breeds inside the bayesian regularized neural network, random forest regression and support vector regression. These results support our results because the MARS model showed more consistent results than the SVR algorithm.
Tırınk et al. [33] compared CART, SVR, and RFR artificial intelligence methods using body measurements to estimate body weight at a different share of Polish Merino in the genotype of crossbreds (share of Suffolk and Polish Merino genotypes). To compare the estimation performances of the evaluated algorithms and determine the best model for estimating body weight, various body measurements and sex and birth type characteristics were assessed. According to test sets results, using random forest regression was recommended instead of using CART and SVR algorithms. The performance of the SVR (R 2 = 0.714) algorithm was found to be more reliable than CART (R 2 = 0.578) algorithm by Tırınk et al. [33]. These findings were not compatible with our results. In our study, the CART (R 2 = 0.810) algorithm was better than the SVR (R 2 = 0.761) algorithm. But it should be considered that there were such big differences in the point of determination coefficient. They used the data for 344 animals, while our sample size was 393. Although these sample sizes were similar, this cannot be the reason for the differences between the algorithms. The structure of the data may be the reason for these differences.
Kumar and Kumar-Singh [55] aimed to compare MLR, MARS, SVR, and RFR techniques in hydrological time-series modelling. Their results showed that according to the RMSE and R 2 values, they reported that the SVR algorithm was superior and stated that it was applicable to predict the weekly pan evaporation values for the Ranichauri region. Their results were comparable only for MARS and SVR algorithms with our results. They suggested the use of SVR, but our results had a different recommendation to use MARS algorithms when compared with the study of Kumar and Kumar-Singh [55]. The coefficient of determination difference was only 0.01, which can be ignored that both MARS and SVR can be interpreted as similar, as in our results.
Komadja et al. [56] aimed a model development to predict peak particle velocity (PPV) in opencast mines using CART, MARS, and SVR algorithms. The models were developed using a record of 1001 real blast-induced ground vibrations, with ten corresponding blasting parameters from 34 opencast mines/quarries from India and Benin. The results showed that the MARS model outperformed other models in this study with lower error (RMSE = 0.227) and R 2 of 0.951, followed by SVR (R 2 = 0.87), CART (R 2 = 0.74), and empirical predictors. Their results completely contrast to our findings that our study showed the CART algorithm was the best. Also, Komadja et al. [56] gave 50 results of previous studies results in a list. Interpretation of the list showed no consensus on the algorithm selection. The contrast in the results may be caused by data type and sample size.
Compared to the abovementioned research, artificial intelligence was used for many species and breeds. The extensive variation in previous studies was attributed to the physiologic phase of the animals, raising systems differences, and selection of statistical methods to apply. When our study was compared with other results, it was determined when the selected goodness-offit criteria are examined, it is understood that the models used in this study give similar results. However, proposing some statistical methods for BW estimation using biometric traits is very important for animal production, characterization and breeding purposes. The results obtained showed that much more work needs to be done on this subject.

Conclusion
Finally, in order to provide breeders and researchers with access to a superior population of Romane sheep for use in upcoming research, the CART method should be recommended. The findings of the current study, which uses the goodness of fit criteria to choose the best model for both CART and other methods, demonstrated that data mining and machine learning algorithms can be successfully used to estimate body weight based on other explanatory variables obtained. Even if there are some variations brought on by the breed factor, as was mentioned in the discussion section, more accurate models can be created by carrying out comparable research in addition to using alternative algorithms.